Skip to content

fix(robustness): atomic batch writes + DLQ fsync (P0)#59

Merged
aksOps merged 4 commits into
mainfrom
fix/robustness-p0
Apr 28, 2026
Merged

fix(robustness): atomic batch writes + DLQ fsync (P0)#59
aksOps merged 4 commits into
mainfrom
fix/robustness-p0

Conversation

@aksOps

@aksOps aksOps commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

Summary

P0 robustness items surfaced by a structured two-voice brainstorm with codex (gpt-5-codex). Two passes: codex critiqued my opening positions, I pushed back on three of codex's downgrades, both sides did file-grounded verification of every claim. Of the original 4 P0 items, 2 ship here and 2 are deferred with documented rationale (the dialogue revealed they were based on flawed mental models of the actual code).

Ships

  • Pipeline orphan FK rows — new Repository.BatchCreateAll persists Trace→Span→Log inside a single db.Transaction. A panic between two BatchCreate* calls used to leak orphan parent rows; now the entire batch rolls back atomically. Trace insert preserves the existing ON CONFLICT DO NOTHING idempotency, so DLQ replay re-attempts cleanly.
  • DLQ Enqueue f.Sync() — adds fsync before close so a host crash between Write and Close can't leave a torn JSON file that permanently consumes a retry slot until DLQ_MAX_RETRIES evicts it.

Behavior change worth a second look

TestPipeline_FailedTracesContinuesToSpans is renamed/rewritten as TestPipeline_FailedTracesAbortsBatch. Trace insert errors now abort the batch instead of being "tolerated". The previous tolerance was a fragile workaround — traces are idempotent on retry, so a DLQ replay re-attempts cleanly. TestPipeline_FailedSpansSkipsLogs (orphan-log invariant) preserved.

Deferred (documented, not silent)

  • DLQ replay poisoned-pill via ON CONFLICT DO NOTHINGSpan, Log, MetricBucket have no unique-index beyond auto-increment id, so the clause is a no-op there. Real fix needs a schema migration (uniqueIndex on (tenant_id, trace_id, span_id) for spans, plus a pre-migration dedup on existing rows). Out of scope for P0; needs its own design pass.
  • GraphRAG refresh shadow-swap — codex framed this as "multi-second write-lock stall every 60s". Code reading shows rebuildFromDBForTenant does many tiny Upsert* calls (microseconds each, lock briefly per call), not a single bulk write-lock. Contention is distributed, not bursty. Demoted to a P2 optimization, deferred to a future PR if metrics show it's needed.

Skipped at the brainstorm stage (verification rejected each)

  • MCP semaphore-on-panic (PR fix(post-review): H1-H4 + C1 from deep code review #56's defer release() inside runWithTimeout already covers it)
  • Drain template startup race (reload is synchronous in New() before Start() spawns workers)
  • WS slow-client (per-client buffer is 256 + non-blocking eviction in flush())
  • MCP cache invalidation (5s TTL is by-design)
  • DB query deadline at repo layer (ctx already propagates; gap is at handler middleware)
  • GraphRAG TraceStore restart loss (DB is source of truth; operability only)

Test plan

  • go vet ./... clean
  • go build ./... succeeds
  • go test ./... -race -count=1 — 404 tests pass across 27 packages (pre-existing flake on TestMCP_TenantIsolation_AllGraphRAGTools under load surfaced once, passed on rerun and in isolation; flagged as a follow-up to investigate)
  • TestPipeline_FailedTracesAbortsBatch and TestPipeline_FailedSpansSkipsLogs both pass under -race
  • commit signed (ED25519)

🤖 Generated with Claude Code

aksOps and others added 4 commits April 28, 2026 01:15
P0 robustness items from the Round-2 brainstorm. Two of the four originally
proposed items are deferred with rationale below.

**1. Pipeline orphan FK rows (P0).** A worker panic between BatchCreateTraces
and BatchCreateSpans previously left a parent trace row with no children
(or vice versa). New Repository.BatchCreateAll persists Trace→Span→Log inside
a single db.Transaction; any failure (or panic via the worker's defer recover)
rolls back the entire batch atomically. Trace insert preserves the existing
ON CONFLICT DO NOTHING idempotency.

Behavior change: trace insert errors now abort the batch instead of being
"tolerated". The previous tolerance was a fragile workaround — traces are
idempotent on retry, so a DLQ replay re-attempts cleanly. Test
TestPipeline_FailedTracesContinuesToSpans renamed/rewritten to
TestPipeline_FailedTracesAbortsBatch reflecting the new contract.
TestPipeline_FailedSpansSkipsLogs (orphan-log invariant) preserved.

**2. DLQ Enqueue f.Sync() (P0).** Adds f.Sync() before f.Close() so a host
crash between Write and Close can't leave a torn JSON file that permanently
consumes a retry slot until DLQ_MAX_RETRIES evicts it. (Edit landed on
the prior checkpoint commit on this branch.)

**Deferred from the brainstorm with rationale:**

- "DLQ replay poisoned-pill via per-row ON CONFLICT DO NOTHING" — Spans, Logs,
  and MetricBuckets have no unique-index beyond auto-incrementing id, so
  ON CONFLICT DO NOTHING is a no-op there. Real fix needs a schema migration
  to add uniqueIndex on (tenant_id, trace_id, span_id) for spans (plus a
  pre-migration dedup pass on existing rows). Out of scope for P0; needs
  its own design pass.

- "GraphRAG refresh shadow-swap to fix multi-second write-lock stall" —
  refresh.go does many tiny Upsert* calls (microseconds each, lock briefly
  per call), not a single bulk write-lock window. The "multi-second stall"
  framing didn't survive code reading; actual contention is distributed,
  not bursty. Demoted to P2 optimization, deferred.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sonarqubecloud

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
4.0% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

@aksOps aksOps merged commit e3c87f9 into main Apr 28, 2026
16 of 17 checks passed
@aksOps aksOps deleted the fix/robustness-p0 branch April 28, 2026 01:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant